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Reduced-Space On-Disk Suffix Arrays 
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Abstract — The suffix array is an efficient data structure for 
in-memory pattern search. Suffix arrays can also be used for 
external-memory pattern search, via two-level structures that use 
an internal index to identify the correct block of suffix pointers. 
In this paper we describe a new two-level suffix array-based index 
structure that requires significantly less disk space than previous 
approaches. Key to the saving is the use of disk blocks that are 
based on prefixes rather than the more usual uniform-sampling 
approach, allowing reductions between blocks and subparts of 
other blocks. We also describe a new in-memory structure based 
on a condensed BWT string, and show that it allows common 
patterns to be resolved without access to the text. Experiments 
using 64 GB of English web text and a laptop computer with 
just 4 GB of main memory demonstrate the speed and versatility 
of the new approach. For this data the index is around one- 
third the size of previous two-level mechanisms; and the memory 
footprint of as little as 1% of the text size means that queries 
can be processed more quickly than is possible with a compact 

FM-lNDEX. 

Index Terms — String search, pattern matching, suffix array, 
Burrows-Wheeler transform, succinct data structure, disk-based 
algorithm, experimental evaluation. 

I. Introduction 

STRING search is a well known problem: given a text 
T[0...n— 1] over some alphabet E of size a — |E|, 
and a pattern P[0 . . . m — 1], locate the occurrences of P in 
T. Several different query modes are possible: asking whether 
or not P occurs (existence queries); asking how many times 
P occurs (count queries); asking for the byte locations in T 
at which P occurs (locate queries); and asking for a set of 
extracted contexts of T that includes each occurrence of P 
(context queries). 

When T and P are provided on a one-off basis, sequential 
pattern search methods take 0(n + to) time. When T is 
fixed, and many patterns are to be processed, it is likely to 
be more efficient to pre-process T and construct an index. 
The suffix array HI is one such index, allowing locate queries 
to be answered in 0(m + logn + fc) time when there are 
fc occurrences of P in T, using O(nlogn) bits of space in 
addition to T. Further alternatives are discussed in Section HI1 
Suffix arrays only provide efficient querying if T plus the 
index require less main memory than is available on the host 
computer, because random accesses are required to the index 
and the text. For large texts, two-tier structures are needed, 
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with an in-memory component consulted first in order to 
identify the data that must be retrieved from an on-disk index. 

A. Our Contributions 

We show that if the usual fixed-interval sampling approach 
to creating the in-memory index for a two-level suffix array is 
replaced by a sampling method that respects common prefixes, 
the space required by the suffix array blocks on disk can 
be reduced by as much as 50%. This gain is achieved by 
identifying reducible blocks that can be replaced by references 
to subintervals within other on-disk blocks. 

We also describe a new in-memory structure for indexing 
variable-length common-prefix blocks that is comparable in 
size to the bit-blind tree. In terms of operational functionality, 
the new structure has the benefit of being comprehensive, 
meaning that existence and count searches for frequently- 
occurring patterns can be resolved without disk accesses. The 
new approach employs backward searching and the Burrows- 
Wheeler Transform. 

The methodology developed in order to carry out the 
experimentation allows independent and stratified exploration 
of patterns according to their length and their frequency, and 
is a third key contribution of this paper. Experiments using 
64 GB of English web text and a laptop computer with just 
4 GB of main memory demonstrate the speed and versatility of 
the new RoSA structure. For this data the RoSA's disk index 
is around one third of the size of the previous LOF-SA two- 
level suffix-array mechanism Q, J3), and the small footprint 
of the in-memory part of the index - as little as 1% of the 
size of the input text - means that queries are processed more 
quickly than is possible using an FM-lNDEX. That is, while 
the FM-lNDEX J4), is a much more compact structure, 
all of it must be memory-resident during query processing, 
hindering its ability to search very large texts. 

B. Definitions 

Text T[0 ... n — 1] is assumed to consist of n symbols each 
a member of an alphabet £ = {ao, ai,as,..., a a -i} of size 
a = |E|, augmented by a sentinel in T[n] that is smaller 
than every element in S. The i th suffix of T is the sequence 
T[i...n], including the sentinel, and is denoted by T^. The 
longest common prefix LCP(Ti,Tj) of two suffixes of T is 
the maximal value fc such that T[i + £] = T[j + I] for all 
< I < k. If Tj and Tj are suffixes of T, then Tj < Tj if 
and only if T[t + fc] < T[j + fc], where fc = LCP(Ti,Tj). 
A pattern P[0 . . . to — 1] matches T at i if P[0 . . . m — 1] is 
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Fig. 1. External common-prefix suffix blocks formed for T = "she#sells#shells$" with blocksize < 



identical to T[i . . . i + m — 1], that is, if P is a prefix of the 
i th suffix of T. 

Array SA[0...n] is a suffix array for text T if Tsa[;] < 
Tsa[j] whenever i < j. In the context of a suffix array 
it is then useful to define LCP[i] = LCP(TsA[i-i] , TsA[i])> 
with LCP[0] = -1. The Burrows- Wheeler transform (BWT), 
denoted L, is also required in our development: L[i] contains 
the preceding character of the ith sorted suffix, L[i] = 
T[(SA[i] — 1) mod n]. Figure [T| shows an example string of 
n = 16 characters that is used throughout the discussion, plus 
its sorted suffixes. The column headed SA[i] is the value stored 
in the ith entry in the suffix array for the string; and the 
column headed L[i] is the corresponding BWT symbol, being 
the character immediately prior to the ith sorted suffix. The 
other parts of Figure [T] are described shortly. 

We also employ rank and select operations: for sequence X 
operation rank(X,i,c) returns the number of occurrences of 
symbol or sequence c in X[0..i— 1]; and select (X, i, c) returns 
the position of the ith occurrence of c, counting from zero. 
For example, if X[0..15] = "she#sells#shells", then 
rank(X, 8, "s") is 2, and select {X, 2, "e") is 12. Although 
sophisticated mechanisms exist for implementing rank and 
select that have good asymptotic properties, one of the most 
useful practical approaches simply adds regular cumulative 
sums to a standard bitvector representation, expanding it by 
25% or by 6.25%, depending on the sampling interval [6j, Q. 

II. Background: Suffix Tries, Trees, and Arrays 

A number of index structures can be used for string search 
over a static text T if it is assumed that T and its index can 
both be held in fast random-access memory. 

A. Suffix Trie 

A trie is a tree in which each node is implicitly labeled 
with the concatenation of the edge labels on the path from the 



root, and each of the as many as a edges out of each node 
is explicitly labeled with a single symbol from the alphabet 
S. A suffix trie for a text T contains a leaf for each of the 
n + 1 suffixes T^, each of which stores the corresponding 
index i. The storage required by a suffix trie is proportional to 
the total number of edges in the trie, and might be as large as 
0(n 2 ). The minimum space required by a suffix trie is at least 
n log n bit^j since every location in T is indexed in the tree, 
and involves an address in the range ... n. If the set of child 
pointers at each node is stored as a table indexed by edge label, 
existence and count queries for a pattern P of length m can 
be processed in 0(m) time, and locate queries in 0(m + k) 
time, where k is the number of matching positions. 

B. Suffix Tree 

A suffix tree for text T is a modified suffix trie in which 
the parent-child edges represent sequences of symbols from S 
rather than single symbols; and in which internal nodes that 
only have a single child are eliminated. The edge labels are 
stored as references to T rather than as explicit sequences of 
symbols, and the per-edge space requirement increases from 
O(logcr) bits to O(logn) bits. But the number of edges is 
bounded, and a suffix tree for T has n leaves and at most n 
internal nodes, and occupies at most O(nlogn) bits in total, 
with typical implementations requiring 3n or more logn-bit 
pointers. Searching follows the same process as in a suffix 
trie, but involves an access to T as each edge is traversed, in 
order to match symbols in the pattern. 

C. Blind Tree 

The suffix tree's accesses to the text T are not localized, and 
are relatively costly. In a blind tree |[8| , l[9| , ifTUll the outgoing 

'We assume throughout that logarithms are binary, and that log 2 should 
be taken to mean [log 2 x~\ when appropriate. 
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Fig. 2. Bit-blind tree for the ASCII strings "$", "#", "e", "h", "11", "Is", "s$", "s#", "se", and "sh", being the identifying block prefixes of the ten 
suffix array blocks identified in Figu re [T] when the example string is processed with 6 = 3. The three different types of leaf nodes, and the meaning of the 
dotted lines, are discussed in Section [Tvf The ASCII codes for the characters in question are shown at the top-right. 



edges at each node are represented by just the first symbol 
of the corresponding sequence, rather than by pointers to T. 
The remaining (if any) symbols that label that edge in the 
corresponding suffix tree are not stored. Instead, internal nodes 
store the LCP of the set of strings represented at that node, 
and during querying, when a node is reached, the search steps 
forward to the indicated symbol, bypassing any omitted labels. 

Search in a blind tree follows a similar path as in an 
equivalent suffix tree. At any given node, at most one edge 
can match the next unexamined symbol in the pattern P, and 
if such an edge exists, the search proceeds to the indicated 
child. The risk in following edges that are labeled by just a 
single symbol is that the other symbols that are bypassed may 
not match between P and T. To address that risk, once either 
the pattern has been exhausted, or a leaf has been reached, the 
full pattern is rechecked against the location in T indicated by 
any leaf in the subtree rooted at that node, to examine the 
bypassed symbols. By proceeding with the search based on 
only partial matches, not only is there a saving in space, but 
also the majority of the accesses to T are eliminated. Instead, 
a sequential examination of symbols at a single candidate 
location of T is undertaken, to either verify that a match has 
been correctly identified, or to confirm that there cannot be 
any occurrences of P in T. 

D. Bit-Blind Tree 

A concise form of blind tree has been developed [8| which, 
for clarity, we refer to here as a bit-blind tree. Rather than 
character LCP values and character edge labels, bit-based 
LCP values and binary edge labels are employed. Moreover, 
because internal nodes have exactly two children, the edge 
labels do not need to be stored. The tree becomes deeper by 



a factor of as much as log ex; on the other hand, it takes less 
space. In total, the cost of a bit-blind tree storing the n suffixes 
of a text T is n — 1 internal nodes, each containing a bit-LCP 
value and two pointers (or equivalent); and n leaves, each 
containing a log n-bit suffix pointer. 

Figure [2] shows the bit-blind tree for the set of blocks 
identified in the right-hand side of Figure [T] The reason that 
these particular strings are of interest, and only a partial 
tree is stored, is discussed shortly. The ten strings are each 
represented by one of the leaves of the tree; the categorization 
of those leaves into three types is also described below. 

The bitvector bv at the bottom of Figure [2] describes the 
structure of the bit-blind tree, and eliminates the need for 
explicit pointers at the internal nodes. To create bv the nodes 
of the tree are labeled in row-level order, and a "1" bit is stored 
for nodes with (a pair of) children, and a "0" bit is stored if 
not. The "1" bits exactly correspond to the locations at which 
relative LCP values are required; conversely, the "0" bits 
exactly correspond to the locations at which block pointers are 
required. The required tree navigation operations on internal 
nodes (that is, node identifiers x such that bv[a;] = 1) are then 
provided via rank and select operations, as follows: 

• lchild(x) <-2x rank(b\/, x, 1) + 1 

• rchild(x) <-2x rank(bv,x, 1) + 2 

• LCP[ir] <- LCP '[parent (x)} + LCPdata[rank(bv,x, 1)] 

where LCPdata is a dense array of bit-LCP differentials, as 
shown at the bottom of the diagram, and LCP [parent (x)] will 
have been computed during the previous iteration of the tree 
traversal loop. Details of the three types of leaf node, and 
of the meaning of the SAdata and size fields, are given in 
Section |IV] 
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E. Suffix Array 

As has already been noted, occurrences in T of a pattern 
P can be identified using a binary search in suffix array SA 
using 0(\ogn + m) character comparisons Q), I n addition, if 
an LCP array is provided, the set of all matching locations of 
P in T can be identified in 0(1) time each once the first one 
has been identified. The suffix array is more compact than any 
of the suffix trie or suffix tree-based alternatives, including the 
bit-blind tree, and is typically represented as a single log n-bit 
value for each suffix of T. 

Makinen and Navarro {H] note that runs in the BWT 
string L can be used to identify suffix pointer indirections 
that allow space to be saved. Gonzalez and Navarro |[T2l 
extended this work, recognizing repeated patterns of suffix 
pointer differences using the Re-Pair compression technique. 
But note that when T is small enough that it fits into available 
memory, the FM-lNDEX, described next, is the most attractive 
option. That is, reducing the size of an in-memory suffix 
array does not necessarily lead to performance improvements. 
In Section IV we apply similar techniques to disk-based 



suffix arrays, where the space reduction achieved does make 
a difference. 



F. FM-Index 

The last decade has seen considerable development in the 
area of compressed self indexing. Hon et al. [13] survey much 
of this work; perhaps the best exemplar of the category is 
the FM-lNDEX of Ferragina and Manzini JD, 0. Based 
around the Burrows-Wheeler transform, the FM-lNDEX has 
a highly desirable blend of properties - it allows pattern 
search in 0(m log a) time; it requires space proportional to 
n-fffc(T) + er fc , the information content of the original texj^j 
and it allows reconstruction of the text in entirety from the 
beginning, and from (with additional storage cost) sampled 
re-entry points. 

For texts for which the FM-lNDEX fits into random access 
memory, existence and count queries are fast; while the speed 
of locate queries depends on the sampling rate for decoding, 
and allows a tradeoff between space and speed. We include 
experimental results for the FM-lNDEX in Section |VI| based 
on a new implementation developed as part of a recent 
investigation into compressed bitvector representations (6). 

The FM-lNDEX is less efficient when the compressed 
representation of T is too large for main memory - the 
non-sequential access pattern dictated by the BWT sequence 
makes the FM-lNDEX a poor choice for disk-based search. 
A particular disadvantage of the backward search used in the 
FM-lNDEX is that the range of the search interval is non- 
increasing, but the upper and lower bounds on that interval 
are not convergent. This arrangement means that even the best 
external variants of compressed searching potentially make m 
disk accesses lfl4ll . which is impractical for long patterns. 

2 That is, the number of bits required to store the text using an order- 
k statistical context-based compression model, including an allowance for 
storing the model parameters. 



III. On-Disk Suffix Arrays 

Two approaches have emerged for storing suffix array struc- 
tures on secondary storage: methods that make use of uniform- 
size blocks, so that every block except the last contains 
exactly b pointers; and methods that make use of variable- 
sized blocks, in which b is an upper bound on the blocksize, 
and characteristics of the data are used to determine the block 
boundaries, subject to that bound. 

A. Uniform Blocks and the String B-Tree 

Baeza- Yates et al. Ifl5l describe the SPAT, a structure in 
which the suffix array is formed into uniform blocks each 
containing b pointers, and the in-memory index is an array 
of n/b fixed-length strings, being the first l s symbols of the 
last suffix in each block. The AUGMENTED-SA proposal of 
Colussi and De Col [ 16 1 also partitions the on-disk suffix array 
into uniform blocks (each of b = logn suffix pointers) but 
with the in-memory index constructed as a suffix tree to the 
(full) first suffix string of the block. Gonzalez and Navarro lfl4ll 
provide a summary of these early techniques. 

Ferragina and Grossi [8| describe a dynamic string search 
structure they call the String B-tree, or SB-Tree. For static 
data of the type considered here, the SB-Tree is implemented 
as a uniform partitioning of a suffix array, with an in-memory 
suffix tree index implemented as a blind tree or bit-blind tree, 
with each leaf containing a block of b suffix pointers to T. 
More than one level of indexing can be used if necessary, 
with all blocks having the same structure. Each node of the 
SB-Tree indexes b strings via 2b bits describing the shape of 
a binary tree of b leaves and b — 1 internal nodes; plus 6—1 
internal node depths, expressed in bit offsets from the start 
of the pattern P, each taking at most log (n log a) bits, where 
n < n is the longest character LCP value across the entire set 
of strings; plus b suffix pointers each of logn bits. If all non- 
leaf blocks are held in memory, the in-memory component of 
a static SB-Tree contains s = \n/b~\ suffix pointers, s — 1 
LCP values, and 2s bits describing the tree shape. 

An advantage of the uniform-sized disk blocks used in the 
SB-Tree is that they allow node addresses to be calculated 
rather than stored, and no pointers are needed to navigate the 
index. The only pointers stored in the SB-Tree - in internal 
nodes as well as in leaf blocks - are to the text T rather than 
to disk blocks. Note also that suffix pointers are required in 
the internal nodes for a static SB-Tree only if blind search- 
induced pattern ambiguity is to be resolved on a per-block 
basis. If the pattern ambiguity is tolerated until the whole 
of P has been handled, then a single holistic check can be 
undertaken against T via a suffix pointer from a leaf node. 
Regardless, as a minimum, a static SB-Tree stores n suffix 
pointers in its leaf blocks, occupying nlogn bits. 

Taking these various considerations into account, the mini- 
mum size for a static SB-TREE covering a text of n symbols 
using a blocksize of b pointers is 



n (2 + log(n log a) + log n) 



(1) 



bits where n < n is the length in characters of the largest 
LCP value. That is, the SB-Tree index might add a space 
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overhead of as much as 100% to the nlogn bits required by 
a plain suffix array. 

B. Variable Blocks and the LOF-SA 

Sinha et al. [2] describe the LOF-SA, a two-level index 
structure in which the block control parameter 6 is an upper 
bound, and suffix array blocks correspond to subtrees in the 
suffix tree. If v is a node in the suffix tree for text T, and 
if size (parent (v)) > b and size(v) < b, then a suffix array 
block is formed corresponding to node v. All elements in the 
block share the prefix associated with v. The divisions shown 
in Figure [T] denote the ten blocks that result when the example 
string is processed allowing at most 6 = 3 suffix pointers in 
each suffix block; and Figure [2] shows how those ten block 
prefixes are stored in a bit-blind tree. 

Sinha et al. use a trie for the in-memory component of the 
LOF-SA, but this has the disadvantage of a quadratic worst- 
case space requirement. A bit-blind trie, and the structure we 
present in the Section [V] both require less space in both the 
average case and the worst case. 

Pattern search using the LOF-SA steps through the symbols 
in P, navigating the in-memory search structure, either until 
the pattern is exhausted, in which case all children of the node 
that was reached are answers to the query; or until a leaf in 
the trie is reached, in which case the answers, if any exist, are 
confined to a single block of the on-disk suffix array. In the 
latter case that block is fetched and searched. 

Regardless of how the internal structure is organized, the 
variable sized disk blocks mean that a disk address of log7i 
bits must be stored at each in-memory leaf. In the on-disk 
blocks, Sinha et al. also store an LCP value for each suffix; 
plus, as was previously sketched by Colussi and De Col |16|, 
a small number / of extension symbols (the fringe) to help 
minimize search ambiguities. Search within a LOF-SA suffix 
block is sequential, capitalizing on the LCP and fringe values. 
Accesses are made to T only if there are gaps in the fringe 
that result in pattern uncertainty. Inclusion of the fringe for 
each suffix increases the size of disk blocks, and each entry 
in each on-disk suffix block contains an LCP value, a pointer 
into T, and a set of fringe symbols. 

Sinha et al. undertook a range of experiments with 2 GB 
of DNA and 471 MB of English text, and patterns of length 
6 to 1,000. With a blocksize bound of b = 4,096 and a fringe 
length of / = 4 characters, the in-memory component and 
on-disk component for the 471 MB English text file required 
21 MB and 5.5 GB respectively, and yielded searching times 
around half or less of the SPAT, and around 8 times faster than 
a pure suffix array. Moffat et al. [3| considered compression of 
the on-disk components, and showed that the space required 
by the on-disk data can be reduced by approximately 40%, 
from 12n bytes down to around 7. In bytes. 

The next two sections describe our enhancements to the 



LOF-SA. First, in Section IV we show that as many as half 
of the suffix pointers can be elided, via a process we call 
block reduction. Then, in Section[V]we introduce a condensed 
BWT in-memory index structure that provides a unique mix 
of attributes and allows fast searching over a set of strings. 



IV. Reducible Blocks 

This section considers the suffix array reduction process of 
Makinen and Navarro ifTTl and shows that it can be applied 
to variable-size on-disk suffix blocks. 



A. Identifying Reductions 

A whole-block reduction is possible exactly when all of 
the BWT symbols corresponding to the suffixes contained in 
a block are the same. For example, in Figure [T] the suffixes 
corresponding to the prefix "h", with pointers SA[6] = 1 and 
SA[7] = 11, form a block when 6 = 3; and both have an "s" in 
the column headed L[i]. Hence, a reduction to the suffix "sh" 
is possible. Examination of the set of 6 = 3 blocks shown in 
Figure [T] reveals that the suffixes at offsets 8-9 for "11" can 
be reduced to a (subset of) the block at suffixes at offset 3-5 
for "e"; and that, via two such steps, the suffixes at offsets 
10-11 for "Is" can be reduced to the same underlying block. 
The three arrows at the left of Figure [T] show the full set of 
block relationships that exist in the example string, with the 
three reducible blocks lightly shaded; the same reductions are 
also noted with the dotted arrows in Figure [2] 

B. Singleton Blocks 

The variable block approach also sometime generates blocks 
with just one pointer in them; we call these singleton blocks. 
They are unshaded in Figures [T] and [2] and represent another 
opportunity for space savings, since the corresponding suffix 
pointers can be stored directly in the in-memory index, rather 
than placed in a suffix block on disk. In the example string 
there are four singleton blocks. Only non-singleton irreducible 
blocks need to be placed onto disk; as can be seen in the 
example, there are three such blocks, and they contain a total 
of only seven suffix pointers. 

C. Storing Information About Reductions 

The details of each block reduction are held as a (A x , Ad) 
pair relative to an irreducible block, where is the offset 
from the start of the irreducible block at which the reduced 
block commences, and A^ is the offset to be applied to each 
suffix pointer. The three reducible blocks in Figure [T] are 
annotated with their offset pairs. 

To save memory space, each leaf of the in-memory index 
stores only a block number, and all other information is stored 
as part of the disk blocks. Each suffix block contains a small 
header table of A x and A^ values, one pair per suffix block 
(reducible or irreducible) that is hosted within that set of suffix 
pointers. This table maps information accumulated during 
the in-memory search (position reached in the pattern, and 
current suffix interval width) to (A^A^) pairs that are used 
to continue the search within the block. The in-memory part 
does not differentiate between reducible and irreducible blocks 
at all - the latter correspond to A^ = and A^ = 0. 

The in-memory structure identifies singletons by virtue of 
the fact that the search interval is one. Singletons are also 
reducible, by definition, but a search-time disk access can be 
saved if they point directly to T rather than via a suffix block. 
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In Figure [2] non-singleton pointers are marked with a "b", but 
no such differentiation is required in practice, since singleton- 
block suffix pointers exactly correspond to situations where 
the size field (shown at the bottom of Figure |2]i is 1. 

D. Storing the On-Disk Suffix Array 

Each suffix block contains a table of (A^, Ad) offsets, plus 
a set of suffix pointers, plus a set of differential (relative to 
the parent) LCP values, plus two bits per leaf to indicate the 
tree structure, plus a small fixed overhead on the latter to allow 
rank operations. One key advantage of the LOF-SA variable- 
block arrangement is that each block can store the LCP values 
(shown as LCP data in Figure |2| in compressed form, since 
there is no requirement that all disk blocks be the same size. 
This difference is significant in terms of space utilization. 

In our implementation the LCPs are stored as differences 
relative to their parent in the suffix tree, and coded using the 
Elias 8 code lfl7l with cumulative-sum samples inserted every 
64 values to allow pseudo-random access to be carried out. The 
node sizes are similarly stored cumulatively, so that the size 
of any node can be extracted by subtracting the cumulative 
count of its leftmost child from the cumulative count of its 
rightmost child. Suffix pointers are stored as minimal-width 
binary values, but are not otherwise compressed. We also 
experimented with an alternative approach, in which LCP 
values were stored without being differenced relative to their 
parents, and the tree structure was created from the LCP values 
rather than via the bitvector bv. This option turned out to 
be both larger in size and slower in operation, and was not 
pursued beyond preliminary experimentation. 

V. Indexing Using a Condensed BWT 

Having devised a mechanism for efficiently determining 
and storing block reductions, we now return to the issue of 
how to provide an efficient representation of the in-memory 
index, and introduce a condensed BWT index that provides the 
ability to resolve existence and count queries for frequently 
appearing patterns (patterns that occur more than b times 
in T) without any disk blocks needing to be retrieved. The 
critical observation that makes our approach possible is that 
reversing each of the strings stored in the in-memory index 
allows backward search within them to match a prefix of the 
pattern. Compared to the bit-blind tree, the new approach has 
the advantage of being comprehensive, in that the symbols in 
the pattern are checked exhaustively. 

A. Indexing the Blocks 

The LOF-SA employs a suffix trie (Section II-Ai to store 



the set of block prefix strings, but requires quadratic space 
in the worst-case. A second option is to use a bit-blind tree 
(Section II-C I. Figure [2] shows a tree storing the block prefix 
strings for the example text. Each of the ten leaves corresponds 
to one of the blocks shown in Figure [T| only the irreducible 
blocks, shown with dark shading, need to be stored on disk. 

When bv[x] = and x is the identifier of a leaf, the quantity 
SAdata[x — rank(b\/, x, 1)] indicates where corresponding 



suffix pointer(s) are located, with SAdata another dense array, 
containing either suffix array pointers, or suffix block disk 
addresses (indicated in the example by a "b" prefix). The size 
array also allows count queries to be handled efficiently. 

In total, if there are K suffix array blocks, the structure 
shown in Figure [2] requires storage of: 2K bits for the tree 
structure; K — 1 bit-LCP differentials, each of which is less 
than n log a; K suffix or disk pointers, each of which is less 
than n; and K block sizes, each of which is less than b. In 
the worst case, processing of a pattern P of length m requires 
navigation of the tree from the root to a leaf, and involves 
m log a bit-extraction operations and the same number of rank 
operations, and takes 0(m log a) time. 

B. Backward Search in a Forward BWT 

Ferragina and Manzini [4| show that pattern matching can 
be realized via the BWT string L. Suppose that a suffix 
(j = P[m — i..m — 1] of length i has been matched, and 
that the corresponding SA-interval is [Z6j..r&i]. We denote 
this configuration with the notation (w, i) [/&,-.. r&,-]. At the 
beginning of the search, (e, 0)[0..n— 1] is established. The new 
SA-interval [lbi + i..rbi + i] for lu' — ecu with c = P[m— i— 1] 
is contained within the section of SA corresponding to strings 
that commence with c. The offset from the start of that range is 
computed by counting the number of length-i substrings which 
are both lexicographically smaller than u and preceded by c. 
Hence, (cut, i + l)[C[c] + rank(L, Ibi, c)..C[c] + rank(L, rbi + 
l,c) — 1] is the next configuration of the backward search, 
where C is a a -element array that stores in C[c] the location 
in SA of the first suffix commencing with symbol c, and can 
be computed when L is constructed. 

The best approach for rank on general sequences over a 
non-binary alphabet is to use a wavelet tree [18| or vari- 
ant thereof, which reduces each operation to at most logcr 
operations over binary sequences. Here we use a Huffman- 
shaped tree using compressed bitvectors 1 19 1, which represents 
a sequence of symbols in its Ho self-entropy. As already noted, 
on a binary alphabet, rank and select can be carried out in 
constant time by adding a fixed overhead (25% or 6.25%) on 
top of the original bitvector J6], Q. 

C. Backward Search in a Condensed Backward BWT 

A backward search in a reversed text is equivalent to a 
forward search in a forward text. Figure [3] shows the reversed 
example text in sorted suffix order, with a number of divisions 
marked on the right-hand side. The column headed L T [i] 
shows the full BWT of the reversed text; but for our purposes 
only a subset of the BWT is required, shown in the example as 
CL = "s#lelshe$#". To allow positions in the condensed 
BWT to be mapped to their positions in L T , the bitvector bf 
is used, with bf [i] = 1 when the predecessor symbol of the i th 
suffix is in CL. Similarly, bitvector b\[i] = 1 if the ith entry of 
L T appears in CL. The run-length compressed FM-lNDEX of 
Makinen and Navarro [20] makes use of auxiliary bitvectors 
in a similar manner to what we are about to describe. 

Consider the suffix strings on the right-hand side of Fig- 
ure [T] The block-prefixes (shown by the shading) that need to 
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miri-depth = 11212 1 2 2 



Fig. 3. Full BWT text L T , condensed BWT text CL, and indexing bitvectors bf and bl for the reversed text T r = "sllehs#slles#ehs$" 



be reversed and indexed are "$", "#", "e", "h", "11", "Is", 
"s$", "s#", "se", and "sh". When reversed, they become 
"$", "#", "e", "h", "11", "si", "$s", "#s", "es", and "hs"; 
if those reversed strings were then formed into a suffix trie, 
nodes would be created for all of "$", "$s", "#", "#s", "e", 
"es", "h", "hs", "1", "11", "s", and "si". To create the 
bitvector bf that indicates which of the BWT characters are 
needed in the condensed BWT, the interval [lb, rb] associated 
with each of these nominal suffix trie nodes is located in the 
reversed BWT, and the bits bf[lb] and bf [rfe + l] are set to 1, to 
mark the beginning and end of each reversed search interval. 
Any locations in bf with 1-bits at the end of this stage have 
their corresponding first suffix character located in L T and 
copied in to CL; and an inverse mapping bl is computed that 
stores the locations extracted. For example, in Figure [3] the 
first and fourth suffixes commencing with "s" are tagged in 
bf; those "s" symbols occur in positions L T [0] and L T [10], 
and so both bl [0] and bl [10] are set to 1, and two "s" symbols 
appear in CL. Finally, set of condensed symbol counts CC is 
formed from the condensed BWT string CL. 

Figure |4] details the backward search for a pattern P using 
the condensed BWT CL and corresponding counts CC. As for 
regular backward search, an interval is maintained, initially 
(e, 0)[0..n — 1]. That interval is then narrowed using the 
condensed arrays, adding one more character into the matched 
string at each iteration of the loop. The search commences with 
the rightmost symbol in the reverse of P, which is the leftmost 
symbol in P; and (in the frame of reference established in 
Figure [3]) prepends subsequent matched characters to the left. 
In particular, the search process maintains 

lb = mm{k | T[SA[fc]..SA[fc] +d- 1] = P[0..d- 1]} 
as the first suffix in SA that matches P to depth d, and 

rb = max{fc | T[SA[fc]..SA[fc] + d - 1] = P[0..d - 1]} 
as the last such suffix. 



00 get_interval(P, m) 

01 d <— 0; lb <— 0; rb «— n — 1 

02 while d < m and rb — lb + 1 > b do 

03 c i- P[d] 

04 (lb', rb') <- (rank(b\, lb, "1"), rank(b\, rb + 1, "1")) 

05 (lb , rb") <— (rank(CL, lb' , c), rank(CL, rb' , c)) 
oe if lb" = rb" then 

07 return not_found 

08 lb <- select^, CC[c] + lb", "1") 

09 rb <- select^, CC[c] + rb" , "1") - 1 

10 d d + 1 

n return (P[0..d- \],d)[lb..rb] 

Fig. 4. Backward search using a condensed BWT text CL and a condensed 
count array CC. 

To step from one configuration to the next, symbol P[d] 
must be processed, with lb and rb updated so that the 
assignment d <— d + 1 then restores the invariant. To narrow 
the (lb, rb) interval the process described by Ferragina and 
Manzini [5| is used, but with an added level of complexity: lb 
and rb are first translated into the condensed domain, then 
processed against the condensed BWT CL in that domain, 
and finally translated back to the full domain, ready for 
the next iteration. Those transformations are guided by two 
bitvectors bl and bf (see the example in Figure B), which 
record, respectively, which of the suffixes are needed during 
the search in the condensed BWT, and where the lead symbols 
of those suffixes appear in the full BWT string. Rank and 
select operations on those two bitvectors yield the operation 
sequence shown in Figure [4] 

For example, to match P = "she", the first iteration pro- 
cesses the "s", and the configuration becomes ("s", 1) [12..16]. 
Then a second iteration in which the "h" is processed results in 
the configuration ("hs", 2) [6. .7]. Now the interval is smaller 
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00 get_bwd_id(lb,d) 

01 runjnr <— rank(bf, lb, "1") 

02 if runjnr = then 

03 return 

04 runjpos 4— s elect (bm, runjnr — 1,"1") + 1 

05 x <— min _depth[rank (bm, runjpos, "10")] 

06 return runjpos — runjnr + [d — x) 

Fig. 5. Determining the block identifier matching a reverse search configu- 
ration (oj, d)[lb..rb]. 

than b = 3, so the in-memory search is ended, and the 
indicated suffix block (backward identifier 7, forward identifier 
9) is fetched. A search for "shy" would also require that block 
9 be accessed before the search could be declared a failure. 
On the other hand, the pattern "say" generates the (condensed 
domain equivalent of the) empty configuration ("as", 2)[3..2] 
at step 05 after two iterations, and reports failure at step 07 
without a suffix block being required. 

D. Computing Block Numbers 

Once a configuration (uj, d)[lb..rb] has been established by 
get_interval(), the next step is to map it to a block number; 
that is, identify the correct gray superscript value associated 
with the black block identification circles in Figures [T] and [3] 
Because multiple blocks might map to the same lb value but 
with different depths d, a further bitvector bm is required, 
containing a 0-bit for each block in the forward suffix array, 
plus a 1-bit for each 1-bit in bf, corresponding to blocks in 
the reversed suffix array. The bits are interleaved so that each 
entry point in bm is preceded by a string of 0-bits that indicates 
the number of disk blocks converging at that entry point. The 
process of mapping via that structure, plus another array of 
integers that records the minimum configuration depths at each 
valid entry point, is described in Figure [5] 

Once a block number in the reverse suffix has been iden- 
tified, it is converted to an on-disk byte address via an array 
storing a mapping that is many-to-one because of the reducible 
blocks. The configuration (lu, d)[lb..rb] is then compared with 
the (lb,d) values stored in the block's header, to identify the 
matching (A a , A^) region or subregion of the block at which 
the search should be resumed. 

E. Space Requirement 

The bitvectors and arrays required in memory during query- 
ing are summarized in Table [I] The symbols extracted into the 
condensed BWT are exactly those required during searching 
for any of the block prefix strings. No BWT symbols that 
would only be accessed if rb — lb was permitted to become 
smaller than b are needed. At most two bits are required for 
each node in the corresponding frequency-pruned suffix tree, 
and that tree contains at most 2B nodes if the RoSA contains 
B disk blocks. The maximum number of bits that can be set 
is n, meaning that the actual number of bits set, z, is bounded 
by z < min{4_B, n}. When b is large, B can be expected (but 



not guaranteed) to be small, making the bitvectors bf and bl 
sparse and highly compressible; and making the CL and CC 
arrays that represent the condensed BWT small too. 

F. Execution Time 

Function getjintervalQ in Figure [4] iterates at most once 
for each character in the pattern. A total of two bitvector rank 
operations and two bitvector select operations are required per 
iteration; each of these take 0(1) time. Step 05 involves rank 
operations on an array, CL. That array is implemented as a 
Huffman-shaped wavelet tree, based on underlying bitvectors, 
meaning that symbol-based rank queries can be carried out 
via not more than logcr bitvector-based rank queries, or in 
O (logcr) time . The process of finding the matching block 
identifier (function get_bwd_id() in Figure |5J involves only 
rank and select operations on bitvectors, and takes O(l) time 
per pattern. 

We now bring together these various observations, and state 
the main result of this section. 

THEOREM 1 : Given a set of B strings corresponding to the 
leaves of a pruned suffix tree for a text of n symbols, the 
condensed BWT structure requires 0((B + cr) log n) bits of 
storage and identifies the leaf corresponding to an m-symbol 
pattern in 0(m logcr) time. 

VI. Experiments 

We have implemented and tested our Reduced On-disk 
Suffix Array, or RoSA, and compared it against a range of 
alternatives. 

A. Experimental Hardware 

Experiments were run on two different hardware platforms: 
a MacBook Pro with a 2.4 GHz Intel Core i5 processor, 
4 GB RAM, and 500 GB hard disk; and a MacBook Air 
with 1.8 GHz Intel Core i7 processor, 4 GB RAM, and a 
250 GB solid-state disk. The suffix array itself was prepared 
on a separate server with a large amount of main memory. 

B. Test Data 

Data was obtained from a range of sources, with an 
emphasis on large files. The first suite of test files were 
drawn from the 2009 ClueWeb collection, a large-scale 
web crawj^J Three files were extracted as prefixes of the 
concatenation of the first 64 files in the directory ClueWeb09/ 
diskl/ClueWeb09_English_l/enwp00/ with null bytes in the 
text replaced by OxFF-bytes. (Null byte is the "$" symbol 
reserved in all our implementations to mark the end of the 
input string.) In Table [II] these three files are denoted as Web- 
256, Web-4000, and Web-64000. Two other types of data 
were also used: file DNA-3000 is a text file representing 
the human genome stored as a sequence of ASCII letters 
(primarily "A", "C", "G", and "T"); and file DBLP-1000 is an 

; http://lemurproject.org/clueweb09.php/ 
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TABLE I 

Structures required in memory during RoSA pattern matching. The value z is the number of entries in each of bf and bl. If there 

ARE B SUFFIX BLOCKS, THEN Z < min{4ij, n}. THE FINAL TWO COLUMNS SHOW THE ACTUAL COST FOR TEST FILE WEB-64000, DESCRIBED IN 
TABLE|ll] AND THAT NUMBER EXPRESSED AS A MULTIPLE OF B log 71 BITS, WITH b = 4,096, AND B = 219,319,568 BLOCKS GENERATED. 



Structure 


Type 


Operations 


Parameters 


Space (upperbound, bits) 


Space (actual, MB) 


X B log n 


bf 


bitvector 


select 


z elements, each < x < n 


z(2 + log(n/z)) + o(z) 


135.3 


0.144 


bl 


bitvector 


rank 


z elements, each < x < n 


z(2 + log(n/z)) + o(z) 


135.3 


0.144 


bm 


bitvector 


rank/select 


2B elements, each < x < n 


2B(l + log(n/B)) + o(B) 


37.4 


0.040 


min.depth 


array 


access 


B elements, each < x < n — B 


B log n 


72.3 


0.077 


CC 


array 


access 


a integers, each < x < z 


a log n 


<0.1 


<0.001 


CL 


array 


rank 


z symbols, each < x < o 


O(zH (CL)) = O(zloga) 


74.1 


0.079 


pointers 


array 


access 


B elements, each < x < n 


B log 71 


967.4 


1.023 



TABLE II 

Details of data files. The value of H k is empirical, generated by executing xz — best. 



Name 


Type 


Size 




Hk 




LCP 




(MB) 




(bits/char) 


Median 


Average 


Maximum, h 


WEB-256 


HTML/Web 


256 


129 


0.45 


141 


5,937 


556,673 


WEB-4000 


HTML/Web 


4,000 


129 


0.57 


281 


11,506 


692,160 


Web-64000 


HTML/Web 


64,002 


129 


0.61 


1,896 


20,500 


1,204,953 


DNA-3000 


Text/Genomic 


2,985 


9 


1.65 


16 


554,171 


29,999,999 


DBLP-1000 


XML/B ibliographic 


1,032 


99 


0.90 


36 


45 


1,353 



XML repository containing 844,702 bibliographic references 
to computing research papers^] 

The three different types of data differ markedly in the 
extent to which they contain sequence repetitions. In the web 
data the LCP values are particularly high, caused by reuse of 
formatting text, and by duplicate documents. The median LCP 
is much lower for the XML and DNA data; but note that the 
file DNA-3000 contained a repeated subsequence of thirty 
million characters. The three data types also differ in the size 
of the alphabet used, and in compressibility. To estimate the 
latter quantity, the column marked H k shows the compression 
achieved by a high-quality mechanism, expressed in terms of 
bits per character relative to the original. The web and XML 
data are highly compressible; the DNA file somewhat less so. 



b=65,536 
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1 

5.0 
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Fig. 6. Space and processing time of in-memory search using condensed 
BWT approach as a function of blocksize, for three different bitvector 
representations. Data is for WEB-256, averaged over 1,000 patterns with 
|P| = 40 and k = 10,000 matches per pattern, and with the blocksize 
varying between b = 2 8 and b = 2 16 . 



C. Test patterns 

To generate test queries, a suffix tree representation of each 
file was processed sequentially, and a large set of (pattern, 
frequency) pairs identified. These were then quantized by 
both pattern length and by pattern frequency, with agreement 
assumed in the second dimension if the actual frequency was 
within 25% of one of a set of target frequencies. This approach 
allowed a total of 25 different query sets to be formed for each 
file, representing all combinations of |P| € {4, 10, 20, 40, 100} 
and pattern frequency k e {10°, 10\ 10 2 , 10 3 , 10 4 }. On the 
web data, all combinations occurred more than 1,000 times, 
and experiments were run on random subsets of size 1,000 
drawn from the corresponding category. Selected combinations 
of | P | and k were used for the other datafiles, and results are 
similarly the average over 1,000 patterns. It was not possible 
to identify any patterns with |P| — A and k — 10,000 on 
DNA-3000, and as a result one entry is omitted in the tables 
below. 



TABLE III 

Percentage of index space required by components of 
condensed BWT index for Web-4000. 



Component 


6 = 2« 


6 = 2^ 


fc = 2 ib 


Bitvectors (bf, bl; SD-array) 


14.3 


22.4 


31.7 


Condensed BWT (CL; wavelet tree) 


5.2 


6.5 


7.8 


Auxiliary information 


7.7 


8.9 


9.7 


Pointers (binary) 


72.8 


62.2 


50.7 



D. Compressed Bit Vectors 

A key decision is how to represent the two large bitvec- 
tors. Conceptually each of them contains n bits, but, by 
construction, the number of 1 bits is close to the number of 
suffix array disk blocks, and so they are sparse and amenable 
to compression. The drawback of compression is that rank 
and select operations become slower. Figure [6] compares the 

^ |http://dblp.uni-trier.de/xml/| 
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TABLE IV 

IN-MEMORY SEARCH STRUCTURES FOR VARIABLE SUFFIX ARRAY BLOCKS. 



Data 


b 


Memory (MB) 


Query speed (microseconds/query) 




Condensed BWT 


Bit-blind tree 


Condensed BWT 


Bit-blind tree 


WEB-4000 


2 1U 


269.8 


329.1 


33.7 


36.3 


WEB-4000 


2 12 


98.6 


112.2 


24.3 


31.5 


Web-4000 


2 14 


15.8 


15.1 


19.7 


28.6 


DBLP-1000 


2 io 


58.3 


58.4 


26.8 


29.4 


DBLP-1000 


2 12 


21.1 


18.9 


19.2 


24.2 


DBLP-1000 


2 14 


7.6 


6.2 


15.6 


19.9 


DNA-3000 


2 io 


410.2 


382.8 


29.7 


24.3 


DNA-3000 


2 12 


342.9 


319.3 


21.1 


21.0 


DNA-3000 


2 14 


326.6 


307.8 


17.8 


17.3 



space and access cost of three different representations for 
the two bitvectors, with space plotted on the horizontal axis, 
measured as the ratio of the complete condensed BWT data 
structure as a fraction of the text size; and processing time 
per matched character plotted vertically. The alternatives are 
denoted by their sdsl class identifier^] uncompressed bitvec- 
tors (class bit_vector); the well-known RRR structure |fl9ll 
(rrr_vector<63>); and the SD-array (sd_vector<>) of 
Okanohara and Sadakane 0211 . The SD-array offers the best 
balance, and while it is not always faster than the uncom- 
pressed bitvector alternative, it occupies much less space. 

Once the bitvectors are compressed, the disk block point- 
ers are the most costly component of the condensed BWT 
index. These are addresses into the index (for irreducible 
and reducible blocks) or into the text (for singletons), and 
are represented as minimal-width binary numbers. Table III 
shows the percentage of the total memory space required by 
each of the four main components of the condensed BWT 
search structure, for the file Web -4000 and three different 
blocksizes. The dominance of the pointers is clear. 

E. Baseline Methods and Total Disk Space 

In any experimental comparison it is important to compare 
against appropriate reference points. The RoSA structure - 
consisting of condensed in-memory BWT array index, and 
a reduced set of suffix array blocks stored on disk, can be 
compared with the LOF-SA (which in turn is compared by 
Sinha et al. ||2) against previous data structures); with the 
SB-Tree; and with the FM-Index. The FM-Index is not 
a two-level disk-based mechanism, and can only be used if 
the complete structure fits main memory. Nevertheless, it is 
substantially smaller than the other structures, meaning that 
its zone of applicability is larger, and overlapping with the 
size range for which two-level structures are appropriate. 

Table |V| compares index sizes for these various approaches, 
including both components for the two-level ones. The values 
for the RoSA and FM-lNDEX are measured based on our 
experimental implementations. There is no software for the 
SB-TREE or LOF-SA capable of handling the data sizes used 
in our experiments, and the values shown in the table marked 
with "*" are computed using Equation [T] (in Section III i for 
the SB-Tree, and estimated from the results given by Sinha 



TABLE V 

Total memory and disk space required for two-level suffix 
array structures and the fm-lndex, for web-4000. 



Structure 






Ref. 


Size (GB) 


Suffix array 






11) 


15.6 


LOF-SA 


6 


= 4,096 


|2| 


46.9* 


LOF-SA 


b 


= 4,096 


|3| 


27.3* 


SB-Tree 


b 


= 4,096 


m 


24.5* 


RoSA 


b 


= 4,096 


this paper 


7.8 


FM-lNDEX 






m 


0.6 



'https://github.com/simongog/sdsl 



et al. 0, for the LOF-SA. With the exception of the FM- 
Index, all of these structures require that the text T also be 
stored, adding a further 3.9 GB. 

As can be seen, the block reductions achieved in the RoSA 
mean that it is by far and away the smallest of the two-level 
approaches. Indeed, the RoSA index requires just half the 
space of a plain suffix array. On the other hand, the SB- 
Tree and the LOF-SA are expensive to store; neither of 
these structures support block reductions, and in the case of 
the SB -Tree, the LCP values are also a costly component 
because the fixed block structure means that they cannot be 
stored compressed. Because of their clear space superiority, 
the remainder of the experimentation focuses on the RoSA 
and the FM-lNDEX alone. 

F. Choice of In-Memory Structure 

The second step of the experimental evaluation was to 
compare the condensed BWT method with the bit-blind tree, 
in terms of memory space required and search time to identify 
suffix blocks (Table |IV) , Search times are measured over 
frequently-occurring long queries (|P| =40 and k = 10,000) 
(so that the search is driven towards the extremities of the 
in-memory structure); and include only the cost of processing 
the in-memory data structure. 

The two methods are comparable in their space require- 
ments, with the bit-blind tree sometimes being a little smaller, 
and the condensed BWT structure sometimes being a little 
smaller. The condensed BWT has a small but consistent ad- 
vantage in terms of CPU time. Search in the condensed BWT 
structure requires fewer loop iterations than in the bit-blind 
tree, but each iteration is more expensive. Note that Table IV 
does not include the cost of the disk accesses to T needed 
to resolve the uncertainty inherent in the bit-blind search 
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TABLE VI 

Space required by RoSA query-time index components with 
b = 4,096, expressed as multiples of the source text size. 



File 


Memory 


Disk 


Total, inc. T 


WEB-256 


0.033 


1.943 


2.976 


WEB-4000 


0.025 


1.961 


2.986 


WEB-64000 


0.022 


1.900 


2.922 


DBLP-1000 


0.020 


2.126 


3.146 


DNA-3000 


0.116 


4.704 


5.820 



log b 



TABLE VII 

Disk accesses per count query for file Web-4000, with 
b = 4,096. 



Fig. 7. Average size of irreducible blocks (in pointers). 
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Fig. 8. Fraction of pointers in reducible, irreducible, and singleton blocks 
for WEB-4000 and different values of b. 



process. Details of disk access costs are presented shortly; 
the condensed BWT arrangement has a clear advantage when 
that cost is included. 



G. Blocksizes and Non-Uniform Sampling 

Figure [7] depicts the average number of pointers stored in 
each irreducible block for three of the test files. The growth in 
average block size is linear in the size of the block, but for the 
non-genomic data the average is well below the limit b. This 
relationship is not unexpected - blocks are formed at nodes of 
the suffix tree whenever the parent has a count of more than 
b, but the node in question does not. At that boundary node, 
the available symbol count is split across all of the children. 
When the alphabet size a is large, those child counts will, on 
average, be relatively small. The same observation explains 
why the blocks are larger for the file DNA-3000 - when a 
is small, the average frequency count in each child is likely 
to be larger. 

Figure [8] shows the fraction of the suffix pointers located 
in reducible blocks, irreducible blocks, and singleton blocks 
for Web-4000. When b is small, more than two thirds of the 
suffix pointers are in reducible blocks. That fraction decreases 
as b increases, not because the reductions are no longer 
present, but because the similar sections no longer span whole 
blocks. But even when b = 65,536, around half of the suffix 
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(a) Condensed BWT 



(b) Bit-blind tree 



pointers can be eliminated. Similar behavior was observed for 
DBLP-1000. On the other hand, the DNA data has markedly 
different characteristics, and while it generates many more 
singleton blocks, the number of block reductions is very small. 



Table VI shows the balance between in-memory space and 
on-disk space required by the RoSA for the full set of data 
files. For the web and XML data, the total space required 
is much less than would be required by a plain suffix array 
(which is a factor of 4.75 for DBLP-1000, and of 5.0 for 
WEB-4000). On the other hand, the RoSA handles the DNA 
data relatively poorly, and both the in-memory index and 
the on-disk component are large. Indeed, on the DNA data 
the RoSA takes more space than a plain suffix array, a 
consequence of the relative absence of repetitions. 

H. Disk Accesses and Execution Cost For Count Queries 

Table |VII| shows the number of disk accesses required by 
the two options for the in-memory structure. The benefit of the 
condensed BWT arrangement is clear - because it admits no 
ambiguity, fewer disk accesses are required for count queries 
when the pattern is common in the text and can be resolved 
entirely within the in-memory index. When the pattern is 
frequent, the discrepancy is even greater - the condensed BWT 
allows count queries to be processed without recourse to disk, 
whereas the bit-blind tree still requires an average of more 
than 1.8 disk accesses per query. 



Table VIII shows overall elapsed times for a range of query 
lengths and frequencies across the set of data files (including 
the 64 GB file), and for two hardware platforms. The in- 
memory condensed BWT index for WEB-64000 requires 
1.39 GB (around two-thirds of which is pointers, as shown 
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TABLE VIII 

Execution times in milliseconds per query, using two different hardware platforms, with b = 4,096. 
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MacBook Air, SSD 
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0.07 


0.14 


0.36 


Web-64000 


MacBook Air, SSD 


44.6 


85.6 
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1450 
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2500 
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in the final column of Table [jj, and the on-disk part a total 
of 119 GB, with the latter composed of 1.4 GB for block 
headers and other auxiliary data; 29.5 GB for compressed LCP 
differentials and for tree structure bits; and 82.7 GB for suffix 
pointers. Including the text T, the entire search system requires 
183 GB, a factor of 2.9 relative to the text, and only a little 
over half of the 5.5-factor that would be required by a simple 
suffix array, not even including any allowance for LCP values. 

As can be seen, access via SSD memory is much faster than 
access via mechanical disk. But even with the mechanical disk, 
pattern queries on Web-64000 can be answered by the RoS A 
in under 50 milliseconds. Moreover, search times are largely 
unaffected by pattern length, except that queries on frequently- 
occurring strings are always handled within a small number 
of microseconds. 

/. Compared to the FM-lNDEX 



The last three rows of Table VIII show the query cost of 
a highly-tuned (for both space and speed) FM-lNDEX im- 
plementation that has been demonstrated to outperform other 
available packages [6, Section 6.6]. For Web-4000, a run- 
length compressed wavelet tree and SD-array implementations 
for the two FM-lNDEX bitvectors was used, the fastest con- 
figuration. During querying, this FM-lNDEX version requires 
659.4 MB of memory space. For short count queries it is much 
faster than the RoS A. With a different bitvector representation 
(using the RRR variant), space can be reduced to 404.6 MB, 
but querying time increases by a factor of around three. 

For WEB-64000 (the last two lines of Table |VIII| >, the more 
compact RRR bitvector option was used, requiring 8.3 GB for 
the index. As can be seen, when only a subset of a large index 
can be maintained permanently in memory, the non-sequential 
access pattern means that retrieval times increase dramatically. 
When SSD disk is used the times are still somewhat plausible, 
but the two-second response times that arise when a mechan- 
ical disk is used are anything but plausible. The sequence of 



results in Table VIII clearly highlights the situations for which 
the RoSA is the fastest search mechanism. 

VII. Discussion 

We conclude by comparing the RoSA with other large-scale 
search mechanisms that have been presented in the literature. 



A. Construction and Applicability 

Despite recently developed techniques 11221 . a drawback 
of all suffix array-based pattern search methods is the cost 
of building the suffix array. The structures used in our 
experiments were generated on a server with considerably 
more memory than the laptops that were used for the search 
experiments, and reflect the situation for which we believe 
static two-level structures are best suited - namely, when large 
fixed texts are to be pre-processed by a central service to make 
"searchable packages" that can be distributed onto low-cost 
devices for querying purposes. 

The FM-lNDEX is a strong competitor for the same type 
of applications. It has approximately the same construction 
cost, but a much smaller query-time disk storage footprint. 
The disadvantage of using an FM-lNDEX is that for any given 
text T, its memory requirement is likely to be greater than that 
of the RoSA, because the entire structure must be present in 
memory. That is, there is a size of text for which an FM- 
Index cannot be supported by the available hardware, but 
a RoSA can, albeit with significantly greater disk storage 
consumption. Depending on the exact configuration used, 
locate and context queries might also be slower in an FM- 
INDEX than using the RoSA. 

It is also interesting to calculate the break-even point at 
which a pre-computed data structure becomes more econom- 
ical than sequential search. Construction of the RoSA for 
Web-4000 requires around 100 minutes, and the current 
implementation involves a peak memory requirement of 9n 
bytes during the two suffix sorting steps (external methods 
for suffix sorting are available that reduce the memory cost, 
but increase the construction time). Using the MacBook Pro 
to search the same 4 GB file for patterns using agrepj^] 
requires about three seconds, once the file containing T has 
been brought in to memory. Hence, construction of a RoSA 
index is warranted if more than around 2,000 queries are to 
be processed against the same text T. 

B. Other Recent Work 

Phoophakdee and Zaki ll23l describe a partition/merge ap- 
proach to suffix tree construction that allows them to undertake 



f ftp://ftp .cs.arizona.edu/agrep/ 
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pattern search on a human genome. They compare their 
Trellis approach to other options on files of up to three 
billion DNA base pairs, with a build time of under six hours, 
and a final size of 71.6 GB, or 27 times larger than the input 
text. Using their suffix tree, they are able to undertake queries 
of 100+ base pairs in approximately 60 milliseconds. 

Wong et al. |24] describe a partitioned suffix tree they call 
a CPS -Tree. They experiment with files of 118 million base 
pairs and 4.6 million base pairs, and obtain suffix trees that 
require between In and 9n bytes. With these small test files, 
querying is fast - of the order of 20 microseconds per query 
- because it still takes place in main memory. 

Orlandi and Venturini ||25l have also described a structure 
for storing a pruned suffix tree. Their pruning definition differs 
from the one used in the RoSA, and they retain a node if its 
size is greater than b, whereas in the RoSA a node appears in 
the condensed BWT structure if its parent is of size greater 
than b. The difference means that care must be taken when 
comparing sizes for a given parameter value, since the RoSA 
retains as many as a times more tree nodes than does the 
CPST, including, for example, singleton blocks. 

For a CPST over n symbols in which there are K suffix tree 
nodes retained each of size b or more, the space required by 
Orlandi and Venturini's structure is 0(K\og(ab) + oTogn) 
bits. Direct comparison with the costs shown in Table [I] is 
not possible, because for any given value of b the number of 
nodes K in the CPST is much less than the number of leaves 
B in the RoSA index structure. The RoSA's condensed BWT 
index provides greater functionality, since it retains frequency 
counts for (lb, rb) intervals narrower than b, whereas the 
CPST replies to locate and count queries on rare and non- 
existent patterns with a uniform answer of "don't know, if P 
does exits, it appears fewer than b times". The RoSA also 
stores disk block pointers, a component that is not required 
in the CPST. Orlandi and Venturini ||25ll also describe a 
uniform-sampling index in order to undertake approximate 
count queries, where the returned pattern frequency in count 
queries is correct to within an additive fidelity constraint 
determined at the time the index is constructed. Building a 
CPST requires initial construction of a suffix tree, and needs 
more resources than creation of the BWT string, the basis of 
the RoSA's construction process. 

Other recent work is by Ferguson 11261 . who describes a 
search structure called FEMTO, and provides experiments 
on 43 GB of English text (Project Gutenberg files), and 
on 182 GB of genomic data. The FEMTO system uses 
a partitioned FM-lNDEX, with the search for each pattern 
proceeding through (at least) one disk block per symbol. 
Ferguson gives experimental results showing that the con- 
structed index requires as little as half of the space of the 
original file, but with query response times of 1-3 seconds 
for count queries against selected patterns of 12-28 symbols 
(two to three word phrases, with tests carried out on an 
individual basis on hand-selected strings, rather than as part 
of a regime of extensive measurement) against the English 
text when using a conventional disk drive; and of 10 or more 
seconds when searching the Genomic data for patterns of 
length 128. The high search times arise because of the disk 



accesses. When multiple queries are simultaneously active, 
and duplicate requests for disk blocks can be batched and 
processed all at once, throughput improves dramatically, but 
with a corresponding increase in individual response times. 
Compared to the FEMTO, the methods presented here require 
more disk space for the suffix array data, but operate an order 
of magnitude more quickly. 

Another approach to large-scale pattern search is to index 
overlapping t-grams from T, each containing t consecutive 
symbols. In total, n — t + 1 locations in T are indexed via 
a vocabulary containing at most 0(a t ) entries. An inverted 
index is built, storing a variable-length postings list for each 
unique i-gram, and recording the locations in T at which that 
particular combination of t symbols appears 11271 . Queries of 
length m > t are resolved by intersecting the relevant postings 
lists, identifying locations at which fragments overlap in the 
desired manner; queries of length m < t are resolved by taking 
the union of the postings lists of the vocabulary entries that 
contain P within the i-symbol identifier. 

Inverted indexes allow queries to be resolved in two disk 
accesses per query term, one to retrieve a block of the 
vocabulary, and one to retrieve a postings list flTl . If t is 
chosen so that the i-gram vocabulary for T can be held in 
main memory, the number of disk accesses required to match 
a pattern P and resolve locate queries is \m/t\. In terms of 
space, a t-gram index with t w 5 to 10 can be expected to 
consume around 150-200% of the space required by T, and 
to grow larger as t increases. Note that in the t-gram approach 
to pattern search T is not required in memory. 

Tang et al. [28 1 give details of the construction and use of 
n-gram indexes for pattern matching. Puglisi et al. [29] have 
also examined this problem. 

VIII. Summary 

We have carried out a detailed investigation of two-level 
suffix-array based pattern search mechanisms, and: (1) de- 
scribed an efficient mechanism for exploiting whole block 
reductions, to approximately half the space required by the 
suffix array pointers; (2) described and analyzed a condensed 
BWT mechanism for storing and searching the string labels 
of a pruned suffix tree; and (3) described a comprehensive 
approach to testing pattern search mechanisms. We have 
demonstrated that in combination the new techniques provide 
efficient large-scale pattern search, requiring around half the 
disk space of previous two-level techniques, and providing 
faster search than an FM-lNDEX when the data is such that 
the FM-lNDEX cannot be accommodated in main memory. 
While we have focused on the memory-disk interface, we note 
that structures with the properties exhibited by the RoSA are 
effective across all interface levels in the memory hierarchy. 
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